Latest Update: 2021-12-09 20:59:55
Heart disease is the one of the leading cause of death for men, women, and people of different races in the United States. The term “heart disease” includes several types of heart conditions. In the United States, the most common type of heart disease is coronary artery disease (CAD), which can lead to heart attack. In fact, one person dies every 36 seconds in the United States from cardiovascular disease.About 659,000 people in the United States die from heart disease each year, which is in every 4 deaths. Heart disease costs the United States a lot each year, which includes the cost of health care services, medicines, and lost productivity due to death. It would be helpful to examine the pattern of heart disease mortality rate under gender and race stratification to understand the behind association to some extent.
The data was obtained from CDC chronic disease and health promotion data & indicators: https://chronicdata.cdc.gov/Heart-Disease-Stroke-Prevention/Heart-Disease-Mortality-Data-Among-US-Adults-35-by/i2vk-mgdh, and it was organized by National Vital Statistics System and focused on heart disease mortality data in US during 2014. The data was collected in county level. Here were the basic information of this dataset:
Since it might be too vague to observe the mortality rate from the entire country level, a focus on California data is included in this investigation. The FIPS code of each county was obtained from https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv.
The main question is: How were gender and races associated with heart disease death rate in California during 2014?
To better analyze the question step-by-step, several sub-questions are added
The data set included the following important variables which will be used later:
The original data source was included in the Method part.
heartdisease <- "/Users/hms/Desktop/PM566-project/Final/Data/Heart_Disease_Mortality_Data_Among_US_Adults__35___by_State_Territory_and_County.csv"
heartdisease <- data.table::fread(heartdisease)
The data set was cleaned according to EDA checklists. There were several “NAs” under “Data_value” variable, which was due to insufficient information. It was replaced by median value of mortality rate, which was used for later data visualization. Since the data set included data from entire country, only data in California was selected for analysis. After data wrangling, datasets CA_race and CA_gender were created and each of them was merged with FIPS code, whose source was introduced in the Introduction session.
The gender data included variables: county name as LocationDesc, male mortality rate as value_male, female mortality rate as value_female and FIPS code for each county as fips.
The race data included variables: county name as LocationDesc, the White mortality rate as value_white, the Hispanic mortality rate as value_hispanic, the Black mortality rate as value_black, the Asian and Pacific Islander mortality rate as value_asian_pacific, the American Indian and Alaskan Native mortality rate as value_indian_alaskan and FIPS code for each county as fips.
# select data in California
heartdisease_CA <- heartdisease[LocationAbbr == 'CA' & GeographicLevel == 'County']
# convert data into num format
heartdisease_CA$Data_Value <- as.numeric(heartdisease_CA$Data_Value)
# select data under each stratification
CA_gender <- heartdisease_CA[Stratification1 != 'Overall' & Stratification2 == 'Overall']
CA_gender$Data_Value[is.na(CA_gender$Data_Value)] <- median(CA_gender$Data_Value, na.rm = TRUE)
CA_race <- heartdisease_CA[Stratification2 != 'Overall' & Stratification1 == 'Overall']
CA_race$Data_Value[is.na(CA_race$Data_Value)] <- median(CA_race$Data_Value, na.rm = TRUE)
# gender data handling
CA_male <- CA_gender[Stratification1 == 'Male']%>%
select(LocationDesc, Data_Value)%>%
rename(value_male = Data_Value)
CA_female <- CA_gender[Stratification1 == 'Female'] %>%
select(LocationDesc, Data_Value)%>%
rename(value_female = Data_Value)
gender_joint <- merge(CA_male, CA_female, by.x = "LocationDesc",
by.y = "LocationDesc", all.x = TRUE, all.y = FALSE)
gender_joint$Gap <- (gender_joint$value_male - gender_joint$value_female)
df <- read.csv('https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv')
fips <- filter(df,state == "CA")
CA_gender1 <- merge(gender_joint, fips, by.x = "LocationDesc",
by.y = "name", all.x = TRUE, all.y = FALSE)
CA_gender1 <- CA_gender1 %>%
mutate(fips = ifelse(row_number()>= 1,paste0("0", fips)))
url <- 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
counties <- rjson::fromJSON(file=url)
# Race data handling
CA_white <- CA_race[Stratification2 == 'White']%>%
select(LocationDesc, Data_Value)%>%
rename(value_white = Data_Value)
CA_hispanic <- CA_race[Stratification2 == 'Hispanic'] %>%
select(LocationDesc, Data_Value) %>%
rename(value_hispanic = Data_Value)
CA_black <- CA_race[Stratification2 == 'Black'] %>%
select(LocationDesc, Data_Value) %>%
rename(value_black = Data_Value, )
CA_asian_pacific <- CA_race[Stratification2 == 'Asian and Pacific Islander'] %>%
select(LocationDesc, Data_Value) %>%
rename(value_asian_pacific = Data_Value)
CA_indian_alaskan <- CA_race[Stratification2 == 'American Indian and Alaskan Native']%>%
select(LocationDesc, Data_Value) %>%
rename(value_indian_alaskan = Data_Value)
data_list <- list(CA_white, CA_hispanic, CA_black, CA_asian_pacific, CA_indian_alaskan)
CA_race1 <- data_list %>% reduce(inner_join, by = "LocationDesc")
CA_race1 <- merge(CA_race1, fips, by.x = "LocationDesc",
by.y = "name", all.x = TRUE, all.y = FALSE)
CA_race1 <- CA_race1 %>%
mutate(fips = ifelse(row_number()>= 1,paste0("0", fips)))
The visualization of data was included in this part which replied to the sub-questions above. The results corresponded to the general patterns of stratification, differences within stratification and geographical distributions.
CA_gender and CA_race to create distribution graphs for two stratificationsggplotly to obtain the interactive visualization# create distribution graph graph to find association between gender and death rate
p1 <- ggplot(CA_gender, aes(Data_Value, fill = Stratification1))+
geom_density(alpha = 0.5) +
scale_fill_brewer(palette = "Set3") +
labs(
x = "death rate per 100,000 population",
y = "Density",
title = "Distribution of death rate by gender in CA")
p1 <- ggplotly(p1)
# create distribution graph to find association between race and death rate
p2 <- ggplot(CA_race, aes(Data_Value, fill = Stratification2))+
geom_density(alpha = 0.5) +
scale_fill_brewer(palette = "Set3") +
labs(
x = "death rate per 100,000 population",
y = "Density",
title = "Distribution of death rate by race in CA")
p2 <- ggplotly(p2)
For graph of death rates under gender stratification, the distribution of male group was on the right compared to female group in general, which indicated that the male had relatively higher heart disease death rate than the female.
For graph of death rates under race stratification, the distribution of the White, the Black and the American Indian and Alaskan Native were on the right compared to the Hispanic and the Asian and Pacific Islander. It indicated that the White, the Black and the American Indian and Alaskan Native had relatively higher heart disease death rate than the Hispanic and the Asian and Pacific Islander.
Gap in dataset gender_jointGap by bubble charts via plotly for each countyfig_gendergap <- plot_ly(gender_joint, x = ~value_male, y = ~value_female, type = 'scatter', mode = 'markers',size = ~Gap, color = ~LocationDesc, colors = 'Paired',
sizes = c(5, 45),
marker = list(opacity = 0.5, sizemode = 'diameter'),
text=~paste(paste("County: ", LocationDesc),
paste("Death rate/100,000(male):", value_male),
paste("Death rate/100,000(female):", value_female),
paste("Death rate/100,000(gap):", Gap),
sep="<br>"),hoverinfo="text")
fig_gendergap <- fig_gendergap %>%
layout(title = 'Gender Gap on heart disease death rate among CA county',
xaxis = list(title = 'Mortality rate/100,000 population (male)', showgrid = FALSE),
yaxis = list(title = 'Mortality rate/100,000 population (female)', showgrid = FALSE))
fig_gendergap
In the bubble chart, the mortality rate for male group lied on the x-axis and the mortality rate for female group lied on the y-axis. The differences between rates was shown via the size of bubble. The label of each bubble included mortality rate gaps and corresponding county names. From the chart, most counties exhibited obvious gaps between mortality rates, which was consistent as the results in gender distribution graph above. Among all counties, Butte County(208/100,000 population), Plumas County(206.8/100,000 population) and Yuba County(220.4/100,000 population) seemed to have greatest differences.
mortality rate map via plotly to show gradient level of mortality rate under gender stratificationChoropleth map## Pattern map for male group
fig_male <- plot_ly( text=~paste(paste("County: ", CA_gender1$LocationDesc),
paste("Death rate/100,000:", CA_gender1$value_male),
sep="<br>"),hoverinfo="text")
fig_male <- fig_male %>% add_trace(
type="choroplethmapbox",
geojson = counties,
locations = CA_gender1$fips,
z = CA_gender1$value_male,
colorscale="Cividis",
reversescale = TRUE,
zmin = 150,
zmax = 500,
marker=list(line=list(
width=0),
opacity=0.5))%>%
layout(
mapbox=list(
style="carto-positron",
zoom =4,
center=list(lon= -119.42, lat=36.78)))
fig_female <- plot_ly( text=~paste(paste("County: ", CA_gender1$LocationDesc),
paste("Death rate/100,000:", CA_gender1$value_female),
sep="<br>"),hoverinfo="text")
fig_female <- fig_female %>% add_trace(
type="choroplethmapbox",
geojson = counties,
locations = CA_gender1$fips,
z = CA_gender1$value_female,
colorscale="Cividis",
reversescale=TRUE,
zmin = 150,
zmax = 500,
marker=list(line=list(
width=0),
opacity=0.5))%>%
layout(
mapbox=list(
style="carto-positron",
zoom =4,
center=list(lon= -119.42, lat=36.78)))